Back

Briefings in Bioinformatics

Oxford University Press (OUP)

Preprints posted in the last 90 days, ranked by how well they match Briefings in Bioinformatics's content profile, based on 326 papers previously published here. The average preprint has a 0.25% match score for this journal, so anything above that is already an above-average fit.

1
Improved prediction of virus-human protein-protein interactions by incorporating network topology and viral molecular mimicry

Zhang, Z.; Feng, Y.; Meng, X.; Peng, Y.

2026-03-03 bioinformatics 10.64898/2026.02.28.708776 medRxiv
Top 0.1%
22.8%
Show abstract

The protein-protein interactions (PPIs) between viruses and human play crucial roles in viral infections. Although numerous computational approaches have been proposed for predicting virus-human PPIs, their performances remain suboptimal and may be overestimated due to the lack of benchmark dataset. To address these limitations, we first constructed a carefully curated benchmark dataset, ensuring non-overlapped PPIs and minimum sequences similarity of both human and viral proteins in the training and test sets. Based on this dataset, we developed vhPPIpred, a machine learning-based prediction method that not only incorporated sequence embedding and evolutionary information but also leveraged network topology and viral molecular mimicry of human PPIs. Comparative experiments demonstrated that vhPPIpred outperformed five state-of-the-art methods on both our benchmark dataset and three independent datasets. vhPPIpred also achieved high computational efficiency, requiring relatively low runtime and memory. Finally, vhPPIpred was demonstrated to have great potential in identifying human virus receptors, and in inferring virus phenotypes as the virus-human PPIs predicted by vhPPIpred can be used to effectively infer virus virulence. In summary, this study provides a valuable benchmark dataset and an effective tool for virus-human PPI prediction, with potential applications in antiviral drug discovery, host-pathogen interaction research and early warnings of emerging viruses.

2
Enhancing non-local interaction modeling for ab initio biomolecular calculations and simulations with ViSNet-PIMA

Cui, T.; Wang, Z.; Wang, T.

2026-03-20 bioinformatics 10.64898/2026.03.18.712561 medRxiv
Top 0.1%
22.7%
Show abstract

AI-based molecular dynamics simulation brings ab initio calculations to biomolecules in an efficient way, in which the machine learning force field (MLFF) locates at the central position by accurately predicting the molecular energies and forces. Most existing MLFFs assume localized interatomic interactions, limiting their ability to accurately model non-local interactions, which are crucial in biomolecular dynamics. In this study, we introduce ViSNet-PIMA, which efficiently learns non-local interactions by physics-informed multipole aggregator (PIMA) and accurately encodes molecular geometric information. ViSNet-PIMA outperforms all state-of-the-art MLFFs for energy and force predictions of different kinds of biomolecules and various conformations on MD22 and AIMD-Chig datasets, while adapting the PIMA blocks into other MLFFs further achieves 55.1% performance gains, demonstrating the superiority of ViSNet-PIMA and the universality of the model design. Furthermore, we propose AI2BMD-PIMA to incorporate ViSNet-PIMA into AI2BMD simulation program by introducing "Transfer Learning-Pretraining-Finetuning" scheme and replacing molecular mechanics-based non-local calculations among protein fragments with ViSNet-PIMA, which reduces AI2BMDs energy and force calculation errors by more than 50% for different protein conformations and protein folding and unfolding processes. ViSNet-PIMA advances ab initio calculation for the entire biomolecules, amplifying the application values of AI-based molecular dynamics simulations and property calculations in biochemical research.

3
PathoResist AI: A One-Click Web Platform for Rapid Pathogen Resistance Analysis Based on the all_ratio Algorithm

Mai, G.; Dai, Y.

2026-02-13 bioinformatics 10.64898/2026.02.12.705264 medRxiv
Top 0.1%
22.6%
Show abstract

This study introduces a one-stop analysis platform named "PathoResistAI" (https://www.resistpath.com/), which can be used to solve the technical bottlenecks of pathogenic microorganism detection and antimicrobial resistance analysis. The platform is based on nanopore sequencing and the innovative all-ratio algorithm, which integrates four-dimensional parameters (sequence similarity, abundance, matching number, and matching length), significantly improving the detection accuracy of low-abundance pathogens and drug-resistance genes. The platform adopts four layers of modular design (input layer, core engine, dual-channel output, and visualization layer). Users only need to upload data in FASTQ format, and they can obtain automated reports, including pathogen identification and drug-resistance gene prediction within 30 min. The verification results show that the platform can accurately identify bacteria (e.g., Staphylococcus aureus and Serratia marcescens), viruses (e.g., Ebola virus), and drug-resistance genes (e.g., SdeY), which are consistent with the published literature results. Limitations include only supporting long-read sequencing data, small sample size (fewer than 50 cases), and lack of real clinical sample verification. In general, this platform represents the application and exploration of nanopore sequencing in the field of rapid detection of pathogenic microorganisms, and provides a new tool for microbial pathogen or AMR detection research.

4
Structure-aware graph attention based hierarchical transformer framework for drug-target binding affinity prediction

Kaira, V. S.; Kudari, Z. D.; P, S. S.; Bhat, R.; G, J.

2026-04-22 bioinformatics 10.64898/2026.04.19.719524 medRxiv
Top 0.1%
22.5%
Show abstract

Drug-target interaction prediction is significant in the hit identification phase of drug discovery, enabling the identification of potential drug candidates for downstream optimization. Traditional computational methods have some drawbacks in their ability to represent 3D structural data for both molecules and target proteins, which is required for the intricate protein-ligand interactions that regulate binding affinity. In this approach, we propose a graph transformer-based model (GTStrDTI) that combines an intragraph attention mechanism with cross-modal attention to enrich the representation of both the drug molecule and target protein. This approach comprehensively models both intramolecular structural features and intermolecular interactions, thereby enhancing binding affinity prediction performance. A thorough evaluation on benchmark datasets such as KIBA, DAVIS, and BindingDB_Kd shows that our approach surpasses the state-of-the-art methods under challenging target cold-start settings. Our analysis found that augmenting graph-based 3D structural protein target (C-alpha contact graphs from PDB with threshold distance of 5[A]) and incorporating molecule adjacency information, boosts predictive performance, thus contributing towards narrowing the gap between computational and experimental research.

5
TF-IDF k-mer-based Classical and Hybrid Machine Learning Models for SARS-CoV-2 Variant Classification under Imbalanced Genomic Data

Haque, N.; Mazed, A.; Ankhi, J. N.; Uddin, M. J.

2026-04-02 bioinformatics 10.64898/2026.04.02.716024 medRxiv
Top 0.1%
22.3%
Show abstract

Accurate classification of SARS-CoV-2 genomic variants is essential for effective genomic surveillance, yet it is challenged by extreme class imbalance, limited representation of rare variants, and distribution shifts in real-world sequencing data. In this study, we employed hybrid RF-SVM framework designed for robust detection of rare SARS-CoV-2 variants. It integrates a random forest and a polynomial-kernel based support vector machine to enhance sensitivity to minority classes while maintaining overall predictive stability. We systematically compared classical machine learning models, deep learning approaches, and hybrid strategies under both standard and distribution-shifted evaluation settings. Our results show that classical models using TF-IDF-based k-mer features outperform deep learning methods on macro-averaged performance metrics. The Random Forest classifier using TF-IDF Feature achieved the best overall performance, with a macro-averaged F1-score of 0.8894 and an accuracy of 96.3%. The model also demonstrated strong generalization ability, as evidenced by stable cross-validation performance (CV accuracy = 0.9637). Hybrid RF-SVM model further improves rare variant detection under severe class imbalance. Calibration analysis indicates reliable probability estimates for common variants, although challenges persist for minority classes. Overall, this study highlights the limitations of deep learning in highly imbalanced genomic settings and demonstrates that carefully designed hybrid machine learning approaches provide an effective and interpretable solution for rare SARS-CoV-2 variant detection.

6
GCN-Mamba: Graph Convolutional Network with Mamba for Antibacterial Synergy Prediction

Su, H.; Liang, Y.; Xiao, W.; Li, H.; Liu, X.; Yang, Z.; Yuan, M.; Liu, X.

2026-03-12 bioinformatics 10.64898/2026.03.10.710738 medRxiv
Top 0.1%
18.9%
Show abstract

The escalating crisis of antimicrobial resistance necessitates novel therapeutic strategies, among which drug combination therapy shows great promise by enhancing efficacy and reducing toxicity. However, identifying effective synergistic pairs from the vast combinatorial space remains experimentally challenging and resource-intensive. To address this, we introduce GCN-Mamba, a deep learning framework that integrates Graph Convolutional Networks (GCN) with the Mamba State Space Model. This architecture captures both local molecular topological structures and global implicit interactions by leveraging Extended 3-Dimensional Fingerprints (E3FP) and bacterial gene expression profiles. Evaluation on a comprehensive dataset demonstrated that GCN-Mamba significantly outperforms classical machine learning models in predictive accuracy. In a targeted case study against Methicillin-resistant Staphylococcus aureus (MRSA), the model successfully rediscovered known synergistic pairs, such as Quercetin and Curcumin, consistent with recent literature. Furthermore, prospective in vitro validation confirmed a novel synergistic combination of Shikimic acid and Oxacillin, validating the models practical utility. By efficiently prioritizing potential candidates, GCN-Mamba serves as a powerful and reliable tool for accelerating the discovery of synergistic antimicrobial combinations, effectively bridging the gap between computational prediction and experimental validation.

7
TriGraphQA: a triple graph learning framework for model quality assessment of protein complexes

Liang, L.; Zhao, K.

2026-03-20 bioinformatics 10.64898/2026.03.17.712533 medRxiv
Top 0.1%
18.5%
Show abstract

Accurate quality assessment of predicted protein-protein complex structures remains a major challenge. Existing graph-based quality assessment methods often treat the entire complex as a homogeneous graph, which obscures the physical distinction between intra-chain folding stability and inter-chain binding specificity. In this study, we introduce TriGraphQA, a novel triple graph learning framework designed for model quality assessment of protein complexes. TriGraphQA explicitly decouples monomeric and interfacial representations by constructing three geometric views: two residue-node graphs capturing the local folding environments of individual chains, and a dedicated contact-node graph representing the binding interface. Crucially, we propose an interface context aggregation module to project context-rich embeddings from the monomers onto the interface, effectively fusing multi-scale structural features. We conducted comprehensive tests on several challenging benchmark datasets, including Dimer50, DBM55-AF2, and HAF2. The results show that TriGraphQA significantly outperforms state-of-the-art single-model methods. TriGraphQA consistently achieves the highest global scoring correlations and lower top-ranking losses. Consequently, TriGraphQA provides a powerful evaluation tool for protein-protein docking, facilitating the reliable identification of near-native assemblies in large-scale structural modeling and molecular recognition studies.

8
CrossAffinity: A Sequence-Based Protein-Protein Binding Affinity Prediction Tool Using Cross-Attention Mechanism

Guan, J. S.; Wang, Z.; Mu, Y.

2026-02-23 bioinformatics 10.64898/2026.02.22.707318 medRxiv
Top 0.1%
18.4%
Show abstract

Protein-protein binding affinity is important for understanding protein interactions within a protein complex and for identifying strong drug-peptide binders to a target protein. Many structure-based models were built previously with reasonable performance. However, such models require protein complex structure as input, which is usually unavailable due to high cost and experimental constraints. To tackle such an issue, the sequence-based CrossAffinity model was constructed in this study, using the cross-attention module to extract contextual information of interacting protein components while separating the protein complex into two distinct parts to predict the protein-protein binding affinity. CrossAffinity managed to outperform all structure-based models and sequence-based models in an S34 test set containing newer protein complex structures and binding affinity values in a timeline while being trained on an older dataset, showing generalisability to new data points. In other test sets, namely S90, S90 subset and S79*, CrossAffinity also managed to outperform all other sequence-based models while maintaining comparable performance to many recently published structure-based models. The acceptable performance and quick inference of CrossAffinity enable it to be deployed in situations requiring the prediction of the binding affinity of many protein complexes that lack structural information.

9
Predicting Pre-treatment Resistance or Post-treatment Effect? A Systematic Benchmarking of Single-Cell Drug Response Models

Shen, L.; Sun, X.; Zheng, S.; Hashmi, A.; Eriksson, J.; Mustonen, H.; Seppänen, H.; Shen, B.; Li, M.; Vähä-Koskela, M.; Tang, J.

2026-04-14 bioinformatics 10.64898/2026.04.10.717709 medRxiv
Top 0.1%
17.2%
Show abstract

Intratumoral heterogeneity is a major driver of variable drug responses in cancer. Single-cell RNA sequencing (scRNA-seq) enables the characterization of such heterogeneity, providing an opportunity to predict drug response at single-cell resolution. As a result, a growing number of computational models have been developed to infer drug response from scRNA-seq datasets. However, their performance, robustness, and generalizability across different biological contexts have not been systematically evaluated. To address this gap, we conducted a comprehensive benchmarking of representative single-cell drug response prediction models. Using 26 curated datasets comprising over 760,000 cells across 12 cancer types and 21 therapeutic agents, we constructed balanced and imbalanced scenarios to reflect more realistic distributions of drug response labels. To address the lack of ground-truth drug-response labels in conventional scRNA-seq datasets, we further incorporated lineage-tracing data with experimentally validated drug-response annotations, enabling model evaluation in a clinically relevant pre-treatment prediction setting. Our results show that across the tested methods, the prediction performance is markedly higher in cell lines than in tissue samples. Under imbalanced conditions, most methods exhibited sharp performance declines, whereas scDEAL demonstrated the highest robustness. Independent validation using an in-house pancreatic ductal adenocarcinoma dataset further confirms the robustness of scDEAL and its ability to capture biologically meaningful state transitions. Label-substitution experiment revealed that this robust performance partially driven by the models specific training label construction. However, the benchmarking with lineage-tracing data reveals a fundamental limitation: most models capture drug-induced transcriptional changes but struggle to predict a cells intrinsic resistance state prior to treatment. In summary, our study not only defines the performance boundaries of current approaches but also highlights their limitations in addressing intratumoral heterogeneity, extreme class imbalance, and the prediction of intrinsic cellular resistance, emphasizing the need for the development of next-generation single-cell drug response models with stronger clinical relevance.

10
VarDCL: A Multimodal PLM-Enhanced Framework for Missense Variant Effect Prediction via Self-distilled Contrastive Learning

Zhang, H.; Zheng, G.; Xu, Z.; Zhao, H.; Cai, S.; Huang, Y.; Zhou, Z.; Wei, Y.

2026-03-17 bioinformatics 10.64898/2026.03.13.711612 medRxiv
Top 0.1%
14.8%
Show abstract

Missense variants are a common type of genetic mutation that can alter the structure and function of proteins, thereby affecting the normal physiological processes of organisms. Accurately distinguishing damaging missense variants from benign ones is of great significance for clinical genetic diagnosis, treatment strategy development, and protein engineering. Here, we propose the VarDCL method, which ingeniously integrates multimodal protein language model embeddings and self-distilled contrastive learning to identify subtle sequence and structural differences before and after protein mutations, thereby accurately predicting pathogenic missense variants. First, leveraging sequence and structural information before and after mutations, VarDCL generates sequence-structural multimodal features via different language models. It incorporates both global and local perspectives of feature embeddings to provide the model with dynamic, multimodal, and multi-view input data. Additionally, a Self-distilled Contrastive Learning (SDCL) module was proposed to enable more effective information integration and feature learning, enhancing the models ability to detect sequence and structural changes induced by mutations. Within this module, the multi-level contrastive learning framework excels at capturing information differences before and after mutations within the same modality; meanwhile, the feature self-distillation mechanism effectively utilizes high-level fused features to guide the learning of low-level differential features, facilitating information interaction across different modalities. The VarDCL framework not only ensures the models capacity to learn dynamic changes pre- and post-mutation but also significantly improves cross-modal information interaction between sequence and structure, thereby remarkably boosting the models performance in distinguishing pathogenic mutations from benign ones. To validate the effectiveness of VarDCL, extensive experiments were conducted. The ablation study demonstrates that all key components of VarDCL contribute significantly. On an independent test set containing 18,731 clinical variants, VarDCL achieved an AUC of 0.917, an AUPR of 0.876, an MCC of 0.690, and an F1-score of 0.789, outperforming 21 state-of-the-art existing methods. Benchmark analysis shows that VarDCL can be utilized as an accurate and potent tool for predicting missense variant effects.

11
Constructing a Literature-Derived Database for Benchmarking Polygenic Risk Score Construction Methods with Spectral Ranking Inferences

Sebastian, C.; Yu, M.; Jin, J.

2026-03-03 genetic and genomic medicine 10.64898/2026.03.01.26347258 medRxiv
Top 0.2%
14.2%
Show abstract

Polygenic risk scores (PRSs) have emerged as a valuable tool for genetic risk prediction and stratification in human diseases. Over the past decade, extensive methodological efforts have focused on improving the predictive power of PRS, leading to the development of numerous methods for PRS construction. Benchmarking these various methods thus becomes an essential task that is crucial for guiding future PRS applications. While studies have benchmarked subsets of these methods on specific phenotypes and cohorts, the resulting evidence remains fragmented, with a lack of work that comprehensively assess the relative performance of the various PRS methods. In this study, we addressed this gap by systematically constructing a PRS method benchmarking database synthesizing published results from 2009 to 2025. We applied a spectral ranking inference framework with uncertainty quantification to rank 14 PRS methods that had been adequately compared against each other in the literature. We constructed rankings using two complementary sources: original method-development studies and applications/benchmarking studies. While the highest-ranked methods (LDpred2 and AnnoPred) and the lowest-ranked method (C+T) were consistently identified from both sources, the relative ordering of most methods showed moderate variability. We further constructed phenotype-specific rankings, providing more detailed insights into the robustness and phenotype-specific strengths of individual methods. Collectively, the overall and phenotype-specific rankings of the PRS methods, along with the curated benchmarking data from the literature, provide a dynamic and practical reference database that can continuingly be updated with emerging new PRS methods and published benchmarking results to guide future PRS applications.

12
Deciphering Cell Cycle Dynamics and Cell States in Single-cell RNA-seq data with SPAE

Yi, J.; Liu, J.; Guo, P.; Ye, Y.-n.; zhou, X.

2026-03-08 bioinformatics 10.64898/2026.03.05.709782 medRxiv
Top 0.3%
12.5%
Show abstract

Rapid advances in single-cell RNA sequencing (scRNA-seq) technology have enabled the investigation of gene expression changes at the single-cell level, particularly for elucidating the heterogeneity among cells and complex biological processes. This technique reveals subtle molecular differences within individual cells, thereby offering a unique viewpoint for the investigation of cell cycle progression, cellular differentiation, and disease pathogenesis. However, accurately identifying and analyzing cell cycle dynamics in scRNA-seq data remains challenging due to the complexity of the data and the subtle differences between cell states. To address this challenge, we developed the integrated Sinusoidal and Piecewise AutoEncoder (SPAE), an autoencoder-based piecewise linear model, for characterizing the cell cycle dynamics and cell states in scRNA-seq data. Compared with existing methods, SPAE demonstrates substantially improved accuracy and robustness in cell cycle characterization. Additionally, SPAE can accurately predict cancer cell cycle transitions and effectively facilitate the removal of cell cycle effects from gene expression data. SPAE is available for non-commercial use at https://github.com/YaJahn/SPAE.

13
Benchmarking Large Language Models for Predicting Therapeutic Antisense Oligonucleotide Efficacy

Wei, Z.; Griesmer, S.; Sundar, A.

2026-02-19 bioinformatics 10.64898/2026.02.17.706455 medRxiv
Top 0.3%
12.5%
Show abstract

Antisense oligonucleotides (ASOs) are a promising class of therapeutic drugs that can target and modulate genes associated with various diseases. This study benchmarks Large Language Models (LLMs) for predicting ASO therapeutic efficacy through a two-stage approach: (1) molecular embedding-based fine-tuning using SMILES representations, and (2) prompt engineering with zero-shot and few-shot learning using DNA sequences with target gene information. We evaluated general-purpose models (GPT-3.5-Turbo, LLaMA2-7B, Galactica-6.7B) and chemistry-specific models (ChemBERTa, Molformer, BERT) across three datasets: PFRED (522 sequences), openASO (1708 sequences), and ASOptimizer (1267 sequences). DNA sequence inputs with target gene information outperformed SMILES representations. GPT-3.5-Turbo achieved R2 values of 0.6381 (PFRED) and 0.6340 (ASOptimizer) for few-shot prompting with k=3 examples. Code and datasets available at: https://github.com/asundar0128/IndependentStudy

14
Benchmarking single-cell foundation models for real-world RNA-seq data integration

Han, S.; Sztanka-Toth, T.; Senel, E.; Elnaggar, A.; Patel, J.; Mansi, T.; Smirnov, D.; Greshock, J.; Javidi, A.

2026-04-21 bioinformatics 10.64898/2026.04.17.719314 medRxiv
Top 0.3%
12.0%
Show abstract

Single-cell foundation models enable reusable representations and streamlined analysis workflows, yet rigorous evaluation of their performance and robustness in real-world pharmaceutical settings remain underexplored. Here, we benchmarked leading single-cell foundation models (scGPT; scGPT_CP, a continually pretrained checkpoint of scGPT; scFoundation; scMulan; CellFM) against established baseline methods (scVI; Harmony) for data integration using over 1.5 million cells from clinical and preclinical samples. Performance was assessed using well-established and complementary metrics for technical correction and biological structure preservation. We further introduced robustness-oriented rankings to summarize metric trade-offs and quantify performance consistency across datasets and evaluation settings. Our findings show that fine-tuning improved technical correction performance; among the foundation models, fine-tuned scGPT_CP performed best. However, the baseline scVI was the top overall performer, ranking first by our multi-metric Leximax ranking and achieving the highest Pareto Front-1 hit. Collectively, our study provides practical insights for adapting foundation models to real-world drug design and development.

15
SSPSPredictor: A Sequence and Structure based Deep Learning Model for Predicting Phase-Separating Proteins

Wang, T.; Liao, S.; Qi, Y.; Zhang, Z.

2026-04-01 bioinformatics 10.64898/2026.03.30.715224 medRxiv
Top 0.3%
10.6%
Show abstract

Liquid-liquid phase separation (LLPS) underlies the formation of biomolecular liquid condensates (also referred to membraneless organelles, MLOs), which are essential for spatially organizing various biochemical processes within cells. Proteins that play a key role in driving condensates formation are termed phase-separating proteins (PSPs). Given experimental identification of PSPs remains labor-intensive and time-consuming, multiple computational tools have been developed based on empirical features or deep learning. In this study, we propose SSPSPredictor, a novel multimodal predictive model for PSPs with folded or intrinsically disordered structures, leveraging the fusion of sequence information from a protein language model ESM-2 and structural insights from a graph neural network GVP. Compared with existing tools, SSPSPredictor achieves balanced performance in identifying endogenous PSPs, predicting relative LLPS propensities, and recognizing key regions that drive LLPS. Moreover, SSPSPredictor exhibits good interpretability in identifying driving regions along protein sequences, although no relevant supervision was provided during training. Further predictive analysis of the human proteome using SSPSPredictor reveals that the proportion of intrinsically disordered proteins (IDPs) undergoing LLPS is significantly higher than that of folded proteins. In addition, pathogenic variants, especially those located in disordered regions, exhibit higher LLPS propensity than other mutations, uncovering a link between LLPS and diseases at the amino acid level.

16
RareCapsNet: An explainable capsule networks enable robust discovery of rare cell populations from large-scale single-cell transcriptomics

Ray, S.; Lall, S.

2026-02-04 bioinformatics 10.64898/2026.02.02.703229 medRxiv
Top 0.3%
10.6%
Show abstract

In-silico analysis of single cell data (downstream analysis) seeks considerable attention to the machine learning researchers in the last few years. Recent technological advances and increases in throughput capabilities open up great new chances to discover rare cell types. We develop RareCapsNet, a rare cell identification technique through capsule network in large single cell RNA-seq data. RareCapsNet aiming to leverage the landmark advantages of capsule networks in single cell domain, by identifying novel rare cell population through markers genes explained from human-mind-friendly interpretation of lower-level (primary) capsules. We demonstrate the explainability of capsule network for identifying novel markers that are act as signature of certain cell population of rare type. A comprehensive evaluation in simulated and real life single cell data demonstrate the efficacy of RareCapsNet for finding out rare population in large scRNA-seq data. RareCapsNet outperforms the other state-of-the-art not only in specificity and selectivity for identifying rare cell types, it can also successfully extract transcriptomic signature of the cell population. We demonstrate RareCapsNet to the dataset of multiple batch, where the model can store the knowledge of one batch which can be transferred to find out rare cells of other batch without training the model. Availability and ImplementationRareCapsNet is available at: https://github.com/sumantaray/RareCapsNet.

17
Structure-Based TCR-pMHC Binding Prediction and Generalization to Unseen Peptides

Abeer, A. N. M. N.; Roy, R. S.; Qian, X.; Yoon, B.-J.

2026-02-23 bioinformatics 10.64898/2026.02.21.707231 medRxiv
Top 0.3%
10.5%
Show abstract

The interaction between T-cell receptors (TCRs) with the peptide-bound major histocompatibility complex (MHC) intricately impacts the functional specificity of T-cell-mediated adaptive immune response. Consequently, implication in immunotherapy has contributed to the ever-growing computational methods for TCR recognition, which have recently attracted structure-based approaches due to advancements in protein structure modeling. Despite access to structural information of the predicted binding interface, graph neural network (GNN)-based TCR-pMHC binding specificity classifiers tend to show poor accuracy for samples with unseen peptides. In this work, we comprehensively assess the potential factors that critically impact the generalization performance of classifiers trained with computationally predicted structures. Specifically, our experiments focus on analyzing the sensitivity of such predictors to the interaction features in the TCR-pMHC interface and the structural uncertainty. Building on the analysis, we demonstrate how the design of classifier architecture with auxiliary training objectives can improve the generalization performance to novel peptides not yet seen during model training. Overall, our work highlights the challenges of unseen peptide generalization from different perspectives of the GNN-based classifier paradigm, showcasing the strengths and weaknesses of the current state-of-the-art approaches in the generalization landscape.

18
Fast and alignment-free flavivirus classification from low-coverage genomes

Shahid, A.; Ulrich, J.-U.; Kuehnert, D.

2026-02-20 bioinformatics 10.64898/2026.02.20.706982 medRxiv
Top 0.3%
10.4%
Show abstract

High genomic variability among viral species makes sequence classification highly dependent on multiple sequence alignment (MSA) methods, which are both computationally intensive and sensitive to data quality issues. To provide a more efficient and robust alternative, we developed DiCNN-UniK, a Dual-Input Convolutional Neural Network (DiCNN) utilizing unique k-mer signatures and universal k-mer libraries to generate novel and direct embeddings. Instead of relying on k-mer frequency patterns, DiCNN-UniK directly leverages k-mer embedding information, which provides a clear picture of local genomic context. This architecture is designed to handle full-length genomic sequences, overcoming the restrictive 512-token limit common in many genomic foundation models. Trained on Flaviviruses, our model shows high sensitivity, robustness, and reliability, achieving an accuracy of 99% on an independent test set. DiCNN-UniK is trained on full-genome data and is able to handle partial genomic sequences without preprocessing, maintaining high accuracy and precision with genomic coverage as low as 20%. DiCNN-UniK currently stands as the best available model for the classification of flaviviruses, offering a sensitive, robust, and reliable solution for sequence analysis under real-world genomic coverage and data quality scenarios.

19
Exploring protein conformational ensembles using evolutionary conditional diffusion

cui, X.; Ge, L.; Yang, X.; Li, X.; Hou, D.; Zhou, X.; Zhang, G.

2026-01-30 bioinformatics 10.64898/2026.01.30.702768 medRxiv
Top 0.3%
10.4%
Show abstract

Protein conformational ensembles encode the dynamic landscapes underlying biological function, regulation, and allostery. Accurately reconstructing such ensembles while balancing conformational distributions accuracy and physical plausibility remains a fundamental challenge in structural biology, particularly when dynamic data is scarce. Here, we propose DiffEnsemble, a diffusion-based framework designed for modeling protein conformational ensembles. DiffEnsemble learns latent dynamical representations from static protein structures in the Protein Data Bank, integrated with the structural profile derived from the AlphaFold Protein Structure Database as conditional guidance during the diffusion process. Benchmarking on 72 protein targets from the ATLAS molecular dynamics simulation dataset demonstrates that DiffEnsemble outperforms existing methods, including BioEmu and AlphaFLOW. Compared with AlphaFLOW, DiffEnsemble achieves improvements of 28.9% and 11.3% in Pearson correlation coefficients for ensemble pairwise root mean square deviation and root mean square fluctuation, respectively. Importantly, DiffEnsemble successfully captures the dominant motions for 42% of the targets. These results demonstrate that latent dynamical information embedded in static structural data can effectively support the modeling of protein conformational ensembles.

20
PepHammer - a lightweight web-based tool for bioactive peptide matching and identification

Gronning, A. G. B.; Scheele, C.

2026-04-15 bioinformatics 10.64898/2026.04.13.718252 medRxiv
Top 0.3%
10.4%
Show abstract

Peptides are gaining increasing attention as therapeutic agents. Already, peptide-based therapeutics play a key role in the treatment of diverse diseases, including diabetes, obesity, and other complex disorders, and their clinical relevance is expected to expand further in the coming years. Technological and computational advances have substantially enriched peptidomics, massively increasing the scale and depth of peptide identification. As a result, increasingly large and information-rich datasets are now available for downstream analysis and experimental validation. However, the rapid expansion of peptidomics datasets also leads to a corresponding increase in search space, complicating the efficient identification of peptides relevant to specific biological or clinical questions. To address this challenge, we present PepHammer, a lightweight web-based tool for bioactive peptide matching and identification. PepHammer allows users to input up to 10000 peptides (2-150 amino acids in length) and compare them against extensive databases of peptides with predicted or experimentally validated bioactivities and tissue associations using Hamming distance, Grantham distance, as well as partial or exact matching strategies. Via an example study of human milk peptidomics, we demonstrate that PepHammer rapidly provides an overview of the bioactivity and tissue-relational landscape, serving as a starting point for downstream analyses. PepHammer thus enables efficient exploration of large-scale peptidomics datasets and facilitates the identification of biologically relevant peptides.